Homework 5: Disciplined Convex Programming and Data Fitting
Instructions
Please submit a .qmd file along with a rendered pdf to the Brightspace page for this assignment. You may use whatever language you like within your qmd file, I recommend python, julia, or R.
Problem 1: Multi-Label Support Vector Machine (CVX Additional Exercises 6.18)
The basic SVM described in chapter 8 of the book is used for classification of data with two labels. In this problem we explore an extension of SVM that can be used to carry out classification of data with more than two labels. Our data consists of pairs \((\mathbf{x}_i , y_i ) \in \mathbf{R}^n \times \{1, \dots , K\},\,\, i = 1, \dots , m\), where \(\mathbf{x _i\) is the feature vector and \(y_i\) is the label of the \(i\)th data point. (So the labels can take the values \(1, \dots , K\).) Our classifier will use \(K\) affine functions, \(f_k(\mathbf{x}) = \mathbf{a}^T_k \mathbf{x} + \mathbf{b}_k\) , \(k = 1, . . . , K\), which we also collect into affine function from $^n into ^K as $f() = A + . (The rows of \(A\) are \(\mathbf{a}^T_k\) .) Given feature vector \(\mathbf{x}\), we guess the label \(\hat{y} = \mathrm{argmax}_k f_k (\mathbf{x})\). We assume that exact ties never occur, or if they do, an arbitrary choice can be made. Note that if a multiple of 1 is added to \(\mathbf{b}\), the classifier does not change. Thus, without loss of generality, we can assume that \(\mathbf{1}^T \mathbf{b} = 0\).
To correctly classify the data examples, we need \(f_{y_i} (\mathbf{x}_i ) > \mathrm{max}_{k\neq y_i} f_k (\mathbf{x}_i )\) for all \(i\). This is a set of homogeneous strict inequalities in \(a_k\) and \(b_k\), which are feasible if and only if the set of nonstrict inequalities \(f_{y_i} (\mathbf{x}_i ) \geq 1 + \mathrm{max}_{k\neq y_i} f_k (\mathbf{x}_i )\) are feasible. This motivates the loss function: \[
L(A, \mathbf{b}) = \sum_{i=1}^m\left(1 + \mathrm{max}_{k \neq y_i}f_k(\mathbf{x}_i) - f_{y_i}(\mathbf{x}_i) \right)_+
\] where \((u)_+ = \mathrm{max}\{u, 0\}\). The multi-label SVM chooses \(A\) and \(\mathbf{b}\) to minimize \(L(A, b) + mu\|A\|_F^2\) , subject to \(\mathbf{1}^T \mathbf{b} = 0\), where \(\mu > 0\) is a regularization parameter. (Several variations on this are possible, such as regularizing b as well, or replacing the Frobenius norm squared with the sum of norms of the columns of A.). The Frobenius norm is a generalization of the \(2\)-norm from vectors to matrices, it is defined as \(\|A\|_F = \left(\sum_{ij} A_ij^2\right)^{1/2}\) and implemented in CVX using norm(A,'fro').
Show how to find \(A\) and \(\mathbf{b}\) using convex optimization. Be sure to justify any changes of variables or reformulation (if needed), and convexity of the objective and constraints in your formulation.
Carry out multi-label SVM on the data given in multi_label_svm_data.csv. Use the data given in \(\mathbf{X}\) and \(y\) to fit the SVM model, for a range of values of \(\mu\). Use the data given in multi_label_svm_test.csv to test the SVM models. Plot the test set classification error rate (i.e., the fraction of data examples in the test set for which \(\hat{y} \neq y\)) versus \(\mu\).
You don’t need to try more than 10 or 20 values of \(\mu\), and we suggest choosing them uniformly on a log scale, from (say) \(10^{−2}\) to \(10^2\) .
Problem 2: Maximum Likelihood Prediction of Team Abilities (Exercise 7.4 in Convex Optimization Extended Exercises)
Maximum likelihood prediction of team ability. A set of \(n\) teams compete in a tournament. We model each team’s ability by a number \(a_j \in [0, 1], j = 1, \cdots , n\). When teams \(j\) and \(k\) play each other, the probability that team \(j\) wins is equal to \(\mathrm{prob}(a_j − a_k + v > 0)\), where \(v \sim \mathrm{Normal}(0, \sigma^2 )\).
You are given the outcome of \(m\) past games. These are organized as \[ (j^{(i)} , k^{(i)} , y^{(i)} ),\quad i = 1, \dots , m\], meaning that game \(i\) was played between teams \(j^{(i)}\) and \(k^{(i)}\) ; \(y^{(i)} = 1\) means that team \(j^{(i)}\) won, while y (i) = −1 means that team k (i) won. (We assume there are no ties.)
- Formulate the problem of finding the maximum likelihood estimate of team abilities, \(\hat{a} \in \mathhb{R}^n\), given the outcomes, as a convex optimization problem. You will find the game incidence matrix \(A \in \mathbb{R}^{m\times n}\) , defined as
\[ A_{il} = \begin{cases} y^{(i)}\quad &\mathrm{if\,\,} l = j^{(i)} \\ -y^{(i)}\quad &\mathrm{if\,\,} l = k^{(i)} \\ 0 \quad &\mathrm{otherwise} \end{cases}, \] useful. The prior constraints \(\hat{a}_i \in [0, 1]\) should be included in the problem formulation. Also, we note that if a constant is added to all team abilities, there is no change in the probabilities of game outcomes. This means that \(\hat{a}\) is determined only up to a constant, like a potential. But this doesn’t affect the ML estimation problem, or any subsequent predictions made using the estimated parameters.
- Find \(\hat{a}\) for the team data given in team_data_train.csv. (This matrix gives the outcomes for a tournament in which each team plays each other team once.) You may find the CVX function
log_normcdfhelpful for this problem.
You can form A using the commands A = sparse(1:m,train(:,1),train(:,3),m,n) + … sparse(1:m,train(:,2),-train(:,3),m,n);
- Use the maximum likelihood estimate \(\hat{a}\) found in part (b) to predict the outcomes of next year’s tournament games, given in the file team_data_test.csv, using \(\hat{y}^{(i)} = \mathrm{sign}(\hat{a}_{j^{(i)}} − \hat{a}_{k^{(i)}})\). Compare these predictions with the actual outcomes, given in the third column of test. Give the fraction of correctly predicted outcomes. The games played in train and test are the same, so another, simpler method for predicting the outcomes in test it to just assume the team that won last year’s match will also win this year’s match. Give the percentage of correctly predicted outcomes using this simple method.
Problem 3: Flux balance analysis in systems biology. (Exercise 21.3 in CVX Additional Exercises)
Flux balance analysis is based on a very simple model of the reactions going on in a cell, keeping track only of the gross rate of consumption and production of various chemical species within the cell. Based on the known stoichiometry of the reactions, and known upper bounds on some of the reaction rates, we can compute bounds on the other reaction rates, or cell growth, for example.
We focus on \(m\) metabolites in a cell, labeled \(M_1\) , . . . , \(M_m\) . There are \(n\) reactions going on, labeled \(R_1\) , . . . , \(R_n\) , with nonnegative reaction rates \(v_1\) , . . . , \(v_n\) . Each reaction has a (known) stoichiometry, which tells us the rate of consumption and production of the metabolites per unit of reaction rate. The stoichiometry data is given by the stoichiometry matrix \(S \in \mathbb{R}^{m\times n}\) , defined as follows: \(S_{ij}\) is the rate of production of \(M_i\) due to unit reaction rate \(v_j = 1\). Here we consider consumption of a metabolite as negative production; so \(S_{ij} = −2\), for example, means that reaction \(\mathbb{R}^j\) causes metabolite \(M_i\) to be consumed at a rate \(2v_j\) .
As an example, suppose reaction \(R_1\) has the form \(M_1 \to M_2 + 2M_3\) . The consumption rate of \(M_1\) , due to this reaction, is \(v_1\) ; the production rate of \(M_2\) is \(v_1\) ; and the production rate of \(M_3\) is \(2v_1\) . (The reaction \(R_1\) has no effect on metabolites \(M_4\) , . . . , \(M_m\) .) This corresponds to a first column of \(S\) of the form \((−1, 1, 2, 0, \dots , 0)\).
Reactions are also used to model flow of metabolites into and out of the cell. For example, suppose that reaction \(R_2\) corresponds to the flow of metabolite \(M_1\) into the cell, with \(v_2\) giving the flow rate. This corresponds to a second column of \(S\) of the form \((1, 0, . . . , 0)\).
The last reaction, \(R_n\) , corresponds to biomass creation, or cell growth, so the reaction rate \(v_n\) is the cell growth rate. The last column of \(S\) gives the amounts of metabolites used or created per unit of cell growth rate. Since our reactions include metabolites entering or leaving the cell, as well as those converted to biomass within the cell, we have conservation of the metabolites, which can be expressed as \(Sv = 0\). In addition, we are given upper limits on some of the reaction rates, which we express as \(v \preceq v^{\mathrm{max}}\) , where we set \(v_{j}^{\mathrm{max} = \infty\) if no upper limit on reaction rate \(j\) is known. The goal is to find the maximum possible cell growth rate (i.e., largest possible value of \(v_n\) ) consistent with the constraints: \[ Sv = 0,\\ v \succeq 0,\\ v \preceq v^{\mathrm{max}} . \] The questions below pertain to the data found in fba_data.csv.
Find the maximum possible cell growth rate \(G^{\star}\) , as well as optimal Lagrange multipliers for the reaction rate limits. How sensitive is the maximum growth rate to the various reaction rate limits?
Essential genes and synthetic lethals. For simplicity, we’ll assume that each reaction is con- trolled by an associated gene, i.e., gene \(G_i\) controls reaction \(R_i\) . Knocking out a set of genes associated with some reactions has the effect of setting the reaction rates (or equivalently, the associated v max entries) to zero, which of course reduces the maximum possible growth rate. If the maximum growth rate becomes small enough or zero, it is reasonable to guess that knocking out the set of genes will kill the cell. An essential gene is one that when knocked out reduces the maximum growth rate below a given threshold \(G^{\mathrm{min}\) . (Note that \(G_n\) is always an essential gene.) A synthetic lethal is a pair of non-essential genes that when knocked out reduces the maximum growth rate below the threshold. Find all essential genes and synthetic lethals for the given problem instance, using the threshold \(G^{\mathrm{min}} = 0.2G^{\star}\) .